Through this project, our primary aim is to analyze the Airbnb data for Sydney, Australia in order to understand how various predictors influence the rental price of the properties. Prices vary mostly on the basis of the types of property, rooms, the neighborhood, proximity to tourist spots etc. However, we certainly cannot overlook other variables like number of beds, cleaning fee, number of nights as these factors help in selecting a particular Airbnb listing.
We chose ‘Sydney, Australia Airbnb data’ because it sounded like an interesting region to study and also the Airbnb data is sufficiently large, accurate and readily available to be both useful and timely to study.
In recent years, we have observed how the use of various collaborative platforms like Airbnb, Uber, Zipcar have become prevalent and how these platforms present a more unique, personalized way of experiencing the world in a budget friendly manner.
As a potential guest, one might wonder what factors govern the price for a property - whether the listed price is fair according to the facilities provided or the neighborhood it is in.
In a similar manner, the potential owners might want to determine what should be an ideal rental price for the property and how they can improve/upgrade the property to gain more monetary benefits.
Since its humble beginnings, Airbnb has made no secret of its heavy use of data science to build new product offerings, improve its service and capitalize on new marketing initiatives. Similarly, users and hosts rely on data science to gain better economic benefits.
In this study, we will rely on the open Airbnb and New South Wales government data obtained from the following dataset hosted on Kaggle:
Our primary dataset has \(96\) columns and \(36662\) rows of listings data. From the given variables, we believe that below mentioned variables are among the ones which have greatest importance:
-zipcode - tells the location of the property, whether closer to beach or a historical location or business center. - property_type – is it an apartment, a townhouse or a house - room_type – whether the listing is for a private room or the entire property - accommodates – how many people can stay at a time in the property - beds – how many beds are available to accommodate guests - bathrooms – number of bathrooms vary per the guests visiting - amenities – internet, parking, washer included - house_rules – provides information on what is or is not allowed for guests to do during the duration of their stay: keeping pets, smoking, having parties, etc. - cleaning_fee – since it adds on to the total final price - number_of_reviews – the properties with more reviews are generally booked early.
- minimum_nights – what is the minimum number of nights that guest has to pay to rent the property
The goal of our research is to understand the relation between various parameters of listed properties and their advertised renting price.
Before starting with the actual analysis of data we need to perform some cleaning on the source records, including: - Conversion of data formats into either a numeric or a categorical domain - Removal of unnecessary columns (mainly text descriptions) - Fixing records with values missing in fields selected as potential predictors: removing them or finding a suitable default value
First we going to drop all fields that are not going to be included into any of candidate models:
No, we are going to convert types and filter out all the records with values missing in any of potential predictors where the default value can’t be picked without having potentially significant effect on quality of the model:
# Convert column values from strings to numeric representations
airbnb$zipcode = as.factor(strtoi(airbnb$zipcode))
airbnb$security_deposit = strtoi(str_replace(str_replace(str_replace(airbnb$security_deposit, "\\$", ""), ".00$", ""), ",", ""))
airbnb$price = strtoi(str_replace(str_replace(str_replace(airbnb$price, "\\$", ""), ".00$", ""), ",", ""))
airbnb$cleaning_fee = strtoi(str_replace(str_replace(str_replace(airbnb$cleaning_fee, "\\$", ""), ".00$", ""), ",", ""))
airbnb$bathrooms = as.factor(airbnb$bathrooms)
airbnb$beds = as.factor(airbnb$beds)
airbnb$bedrooms = as.factor(airbnb$bedrooms)
# Create dummy variables based on `amenities` column
airbnb$offers_free_parking = str_detect(airbnb$amenities, "Paid parking off premises")
airbnb$offers_breakfast = str_detect(airbnb$amenities, "Breakfast")
airbnb$amenities_tv = str_detect(airbnb$amenities, "TV")
airbnb$amenities_internet = str_detect(airbnb$amenities, "Internet")
airbnb$amenities_bathtub = str_detect(airbnb$amenities, "Bathtub")
# Create dummy variables based on `house_rules` column
airbnb$no_smoking = str_detect(airbnb$house_rules, "no smoking")
# Remove source columns from which dummy variables were created
airbnb = subset(airbnb, select=-c(amenities, house_rules))
airbnb = subset(airbnb, price != 0)
airbnb_complete = completeFun(airbnb, c("price", "beds", "property_type", "bathrooms", "zipcode"))
Now, we’re just going to fix records where the missing values can be replaced with some default option:
airbnb_complete$security_deposit[is.na(airbnb_complete$security_deposit)] = 0
airbnb_complete$cleaning_fee[is.na(airbnb_complete$cleaning_fee)] = 0
And as a last step, we’ll be converting all character and logical fields into factors:
var_mode <- sapply(airbnb_complete, mode)
var_class <- sapply(airbnb_complete, class)
ind1 <- which(var_mode %in% c("logical", "character"))
airbnb_complete[ind1] <- lapply(airbnb_complete[ind1], as.factor)
After the preliminary cleansing procedure, the number of columns and rows in the main dataset has dropped to \(25\) and \(36424\) respectively. Below are the first 10 records from the listings dataset:
| id | bathrooms | price | minimum_nights | maximum_nights | offers_breakfast | amenities_tv | bedrooms | beds |
|---|---|---|---|---|---|---|---|---|
| 12351 | 1 | 100 | 2 | 7 | TRUE | TRUE | 1 | 1 |
| 14250 | 3 | 471 | 5 | 22 | FALSE | TRUE | 3 | 3 |
| 15253 | 1 | 109 | 2 | 7 | FALSE | TRUE | 1 | 1 |
| 20865 | 2 | 450 | 7 | 365 | FALSE | TRUE | 4 | 4 |
| 26174 | 1 | 62 | 1 | 60 | FALSE | TRUE | 1 | 1 |
| 38073 | 1 | 159 | 2 | 730 | TRUE | TRUE | 0 | 1 |
| 39348 | 1 | 84 | 5 | 1125 | FALSE | FALSE | 1 | 1 |
| 44545 | 1 | 130 | 5 | 365 | FALSE | TRUE | 1 | 1 |
| 45440 | 3 | 700 | 3 | 365 | FALSE | TRUE | 5 | 7 |
| 56842 | 1 | 226 | 2 | 200 | FALSE | TRUE | 2 | 2 |
======= ## Data analysis
We are going now to check whether there is any direct relation between ZIP code and the property rental price and what parts of the city are considered the most/least expensive.
## 0% 25% 50% 75% 100%
## 6.0 80.0 136.0 224.5 14999.0
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Sydney&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx-dB0KMAdXA
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Sydney&key=xxx-dB0KMAdXA
As it is seen from the plot above, there is a clear relation between the property’s location and its listed price. Most expensive properties are in close proximity to the beach line, the cheapest are far from it.
ggplot(data = airbnb_complete) + geom_bar (aes(x= property_type), width = 0.3 , fill = "#FF6666")+ggtitle(label = "Total count for various property_types")+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
There are more apartment rental properties than houses, guest suites and condominium. Approximately, 21409 airbnb rentals are ‘Apartments’.
ggplot(data = airbnb_complete , aes(x=property_type, y = price, color= room_type))+ggtitle(label = "Property types in Sydney and their prices") +geom_point()+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
On further analysis, we found out that most of the properties in Sydney are listed as Entire homes but there are options for the guests to share the rental property with other travelers.
qplot(room_type, data = airbnb_complete, facets = room_type ~., width = 0.1, fill = "#FF6666", main = "Count of various room_types options in rental properties in Sydney")
There are around 6 Entire Houses, 21409 Private Rooms and 21409 Shared Rooms available for the guests to choose from in the airbnb listings.
ggplot(data = airbnb_complete , aes(x=room_type, y = price))+ggtitle(label = "Room types vs prices") +geom_point()+theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
The ‘Entire Homes’ are more available for the guests and are comparatively costlier than the Private and shared rooms.
ggplot(data = airbnb_complete , aes(x=accommodates, y = price, col = room_type))+ggtitle(label = "How price changes with change in people the rental accomodates") +geom_point()+xlim(2,17)
## Warning: Removed 3654 rows containing missing values (geom_point).
The prices are not affected much with the how many people can stay in the airbnb.
ggplot(data = airbnb_complete , aes(x=cleaning_fee, y = price))+ggtitle(label = "How much cleaing fee adds to the price of the property") +geom_point()
Cleaning fee is an additional fee added along with the price of the airbnb. The estimated cleaning fee is upto $500 depending on the airbnb size.
ggplot(data = airbnb_complete , aes(x= availability_30, y = price, col = property_type))+ggtitle(label = " Which properties are available 30 days a month and their prices") +geom_point()
The prices are not much influenced with the length of stay.
ggplot(data = airbnb_complete , aes(x= amenities_internet, y = price, col = property_type))+ggtitle(label = " Which properties are available 30 days a month and their prices?") +geom_point()
Internet has been an most important amenity in the rentals and most of the airbnb properties seem to have included it at no additional price.
##Collinearity:
Collinearity often called multicollinearity is a phenomenon between predictor variables (or independent variables), such that they express a linear relationship in a regression model. When predictor variables in the same regression model are correlated, they cannot independently predict the value of the dependent variable. Basically, they explain some of the same variance in the dependent variable, which in turn reduces their statistical significance.
Here, we are trying to find collinearity between the numeric variables in the airbnb_complete dataset, to further understand if the predictors are dependent on one another while calculating the price model.
airbnb_numeric = dplyr::select_if(airbnb_complete, is.numeric)
library(faraway)
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
##
## logit, vif
## The following object is masked from 'package:plyr':
##
## ozone
pairs(airbnb_numeric[1:1000,], col = "dodgerblue")
round(cor(airbnb_numeric), 2)
## id latitude longitude accommodates price security_deposit
## id 1.00 -0.03 -0.13 -0.01 -0.03 -0.09
## latitude -0.03 1.00 0.10 0.16 0.14 0.13
## longitude -0.13 0.10 1.00 0.06 0.17 0.12
## accommodates -0.01 0.16 0.06 1.00 0.50 0.39
## price -0.03 0.14 0.17 0.50 1.00 0.39
## security_deposit -0.09 0.13 0.12 0.39 0.39 1.00
## cleaning_fee -0.03 0.13 0.16 0.57 0.43 0.55
## guests_included 0.00 0.05 -0.03 0.46 0.19 0.17
## minimum_nights -0.02 -0.01 0.02 -0.01 0.02 0.05
## maximum_nights 0.00 -0.02 -0.07 0.05 0.00 0.01
## availability_30 0.20 0.04 -0.20 0.05 0.07 0.07
## cleaning_fee guests_included minimum_nights maximum_nights
## id -0.03 0.00 -0.02 0.00
## latitude 0.13 0.05 -0.01 -0.02
## longitude 0.16 -0.03 0.02 -0.07
## accommodates 0.57 0.46 -0.01 0.05
## price 0.43 0.19 0.02 0.00
## security_deposit 0.55 0.17 0.05 0.01
## cleaning_fee 1.00 0.27 0.05 0.08
## guests_included 0.27 1.00 -0.01 0.04
## minimum_nights 0.05 -0.01 1.00 -0.01
## maximum_nights 0.08 0.04 -0.01 1.00
## availability_30 0.10 0.07 0.07 0.04
## availability_30
## id 0.20
## latitude 0.04
## longitude -0.20
## accommodates 0.05
## price 0.07
## security_deposit 0.07
## cleaning_fee 0.10
## guests_included 0.07
## minimum_nights 0.07
## maximum_nights 0.04
## availability_30 1.00
‘price’ seems to have a high collinearity with ‘accomodates’, ‘cleaning fee’ and the ‘security_deposit’ and insignificant collinearity with the no. of nights a guest stays in airbnb. Also, predictor ‘accomodates’ share a high collinearity with ‘cleaning_fee’, ‘security_deposits’ and the ‘guests_included’.
## Warning in predict.lm(model_01, newdata = airbnb_complete): prediction from a
## rank-deficient fit may be misleading
##
## studentized Breusch-Pagan test
##
## data: model_01
## BP = 1027, df = 469, p-value <2e-16
The base, non-enhanced full additive model has \(498\) coefficients in total. And below are its Q-Q and fitted vs residuals plots displaying obvious violations in both normality and constant variance assumptions which we are aiming to solve in this analysis project: